Affordable Housing Distribution vs Low-Income Population
2.1 Distribution Comparision
The relationship between affordable housing distribution and low-income population distribution is a critical starting point for this analysis. Affordable housing is intended to address the needs of low-income populations, so comparing these distributions helps to identify whether housing resources are effectively located to serve those who need them most. Any misalignment between the two could indicate gaps in accessibility or inequities in resource allocation.
Code
import pandas as pdimport geopandas as gpdimport hvplot.pandasimport holoviews as hvfrom shapely.geometry import Pointhv.extension('bokeh')# Load datasetsaffordable_housing_data = pd.read_csv("./geographicdatascience_python_finalproject/database/Affordable_Housing_Production_by_Building_20241224.csv")income_data = pd.read_csv("./geographicdatascience_python_finalproject/database/Community_Development_Block_Grant__CDBG__Eligibility_by_Census_Tract_-_CSV_20241221.csv")geo_data = gpd.read_file("./geographicdatascience_python_finalproject/database/nyc_tracts.json")# Convert latitude and longitude to geometry for GeoDataFrameaffordable_housing_data["geometry"] = affordable_housing_data.apply(lambda row: Point(row["Longitude"], row["Latitude"]), axis=1)# Convert affordable housing data to GeoDataFrameaffordable_housing_gdf = gpd.GeoDataFrame( affordable_housing_data, geometry="geometry", crs="EPSG:4326")# Ensure CRS matches between GeoDataFramesaffordable_housing_gdf = affordable_housing_gdf.to_crs(geo_data.crs)# Perform spatial join between affordable housing data and census tractsaffordable_housing_with_tracts = gpd.sjoin(affordable_housing_gdf, geo_data, how="left", predicate="intersects")# Summarize affordable housing count by census tracthousing_summary_by_tract = ( affordable_housing_with_tracts.groupby("BoroCT2020")["Building ID"] .count() .reset_index())housing_summary_by_tract.rename(columns={"Building ID": "AffordableHousingCount"}, inplace=True)# Merge affordable housing summary into geo_datageo_data = geo_data.merge(housing_summary_by_tract, on="BoroCT2020", how="left")geo_data["AffordableHousingCount"] = geo_data["AffordableHousingCount"].fillna(0)# Match income data with geo_data using BoroCT fieldsincome_data["BoroCT"] = income_data["BoroCT"].astype(str).str.zfill(11)geo_data["BoroCT2020"] = geo_data["BoroCT2020"].astype(str).str.zfill(11)# Merge low-income population data into geo_datageo_data = geo_data.merge( income_data[["BoroCT", "LowMod_Population"]], left_on="BoroCT2020", right_on="BoroCT", how="left")geo_data["LowMod_Population"] = geo_data["LowMod_Population"].fillna(0)# Map Affordable Housing Distributionmap_affordable_housing = geo_data.hvplot.polygons("geometry", color="AffordableHousingCount", # Map AffordableHousingCount to color cmap="Reds", line_color="white", hover_cols=["BoroCT2020", "AffordableHousingCount"], # Display BoroCT2020 and AffordableHousingCount on hover title="Affordable Housing Distribution by Census Tract", aspect='equal', clim=(0, 50), # Set color range between 0 and 50 clipping_colors={'max': 'darkred'}, # Values above 50 will be dark red colorbar=True)# Map Low-Income Population Overlaymap_low_income = geo_data.hvplot.polygons("geometry", color="LowMod_Population", # Map LowMod_Population to color cmap="Greens", line_color="white", hover_cols=["BoroCT2020", "LowMod_Population"], # Display BoroCT2020 and LowMod_Population on hover title="Low-Income Population by Census Tract", aspect='equal', colorbar=True# Display color bar)# Combine Maps for Visualization(map_affordable_housing + map_low_income).cols(1)
2.2 K-Means Cluster Analysis for Income and Affordable Housing
Firstly, we cleaned the data, excluded singular values to avoid disturbed results. Then, according to the elbow method, we chose k=3 as our optimal cluster number.
Code
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeansfrom sklearn.preprocessing import StandardScalerimport hvplot.pandasplt.style.use('default')# Step 1: Prepare the Data# Select features for clusteringclustering_data = geo_data[["AffordableHousingCount", "LowMod_Population"]].copy()# Step 2: Identify and Remove Outliers# Calculate IQR for both featuresQ1 = clustering_data.quantile(0.25)Q3 = clustering_data.quantile(0.75)IQR = Q3 - Q1# Define outlier thresholdslower_bound = Q1 -1.5* IQRupper_bound = Q3 +1.5* IQR# Filter out outliersnon_outliers =~((clustering_data < lower_bound) | (clustering_data > upper_bound)).any(axis=1)# Filter geo_data to only include non-outliersgeo_data_filtered = geo_data[non_outliers].copy()clustering_data_filtered = clustering_data[non_outliers].copy()# Step 3: Normalize the Filtered Datascaler = StandardScaler()clustering_data_scaled_filtered = scaler.fit_transform(clustering_data_filtered)# Step 4: Determine Optimal Number of Clustersinertia = []for n_clusters inrange(2, 10): kmeans = KMeans(n_clusters=n_clusters, random_state=42) kmeans.fit(clustering_data_scaled_filtered) inertia.append(kmeans.inertia_)# Plot the Elbow Curveplt.figure(figsize=(8, 5))plt.plot(range(2, 10), inertia, marker='o')plt.title("Elbow Method for Optimal Clusters (Filtered Data)")plt.xlabel("Number of Clusters")plt.ylabel("Inertia")plt.grid(True)plt.show()
Code
# Step 5: Fit KMeans Model with Optimal Clustersoptimal_clusters =3# Adjust this value based on the Elbow Curvekmeans_filtered = KMeans(n_clusters=optimal_clusters, random_state=42)geo_data_filtered["Cluster"] = kmeans_filtered.fit_predict(clustering_data_scaled_filtered)geo_data_filtered["Cluster"] = geo_data_filtered["Cluster"].astype(str) # Convert to string for better visualization# Step 6: Visualize Clustering Resultscluster_map_filtered = geo_data_filtered.hvplot.polygons("geometry", color="Cluster", # Use Cluster column for coloring cmap={ # Map each cluster to a specific color0: "#fffbdf", 1: "#ff6f64", 2: "#ffb164", }, line_color="white", hover_cols=["AffordableHousingCount", "LowMod_Population", "Cluster"], title="Clustering Analysis of Affordable Housing and Low-Income Population (Filtered)", aspect='equal', colorbar=False)# Display the filtered mapcluster_map_filtered
The Cluster results show a trend of dispersion and aggregation, indicating that affordable housing and low-income groups tend to gather in New York.
Code
import matplotlib.pyplot as pltfrom sklearn.preprocessing import StandardScaler# Step 1: Ensure the Cluster column is of integer typegeo_data_filtered["Cluster"] = geo_data_filtered["Cluster"].astype(int)# Step 2: Define colors for clusters and create a mappingcluster_colors_filtered = {0: "#fffbdf",1: "#ff6f64",2: "#ffb164"}# Step 3: Scaling setup (same as before)scaler = StandardScaler()scaler.fit(clustering_data_filtered)centers_original = scaler.inverse_transform(kmeans_filtered.cluster_centers_)# Step 4: Plot with explicit color mappingplt.figure(figsize=(8, 5))for cluster_id inrange(optimal_clusters): cluster_points = geo_data_filtered[geo_data_filtered["Cluster"] == cluster_id] plt.scatter( cluster_points["AffordableHousingCount"], cluster_points["LowMod_Population"], label=f"Cluster {cluster_id}", color=cluster_colors_filtered[cluster_id], # Use dictionary mapping alpha=0.8 )# Step 5: Plot cluster centers in the original feature spaceplt.scatter( centers_original[:, 0], # AffordableHousingCount centers_original[:, 1], # LowMod_Population c="black", marker="X", s=250, # Size of cluster center markers label="Cluster Centers")# Step 6: Customize the plotplt.title("KMeans Clustering (Original Feature Space - Filtered)", fontsize=18)plt.xlabel("Affordable Housing Count", fontsize=14)plt.ylabel("Low-Income Population", fontsize=14)plt.grid(color='gray', linestyle='--', linewidth=0.5)# Step 7: Adjust legend positionplt.legend(title="Clusters", fontsize=12, loc='upper right', bbox_to_anchor=(1.15, 1))# Step 8: Display the plotplt.tight_layout()plt.show()
It can be seen that low-income groups and affordable housing are mainly divided into three categories, mainly based on the comprehensive affect of low income and the number of affordable housing. Next, we calculated the corresponding linear relationship between low income and affordable housing theselves. The following conclusions were obtained.
Code
# Use the filtered data for correlation analysiscorrelation_filtered, p_value_filtered = pearsonr( geo_data_filtered["AffordableHousingCount"], geo_data_filtered["LowMod_Population"])print(f"Filtered Correlation: {correlation_filtered}, P-value: {p_value_filtered}")
Surprisingly, according to the data, there is a weak but statistically significant positive relationship between low-income population and affordable housing. This suggests some alignment between housing needs and supply, but the weak correlation indicates other factors might dilute the relationship, such as zoning laws, policy gaps, or spatial mismatches. Therefore, our next step is to explore more factors.